usersDF.write.format("orc")
.option("orc.bloom.filter.columns", "favorite_color")
.option("orc.dictionary.key.threshold", "1.0")
.option("orc.column.encoding.direct", "name")
.save("users_with_options.orc")
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo
A standard technique from the hashing literature is to use two hash functions h1(x)
and h2(x)to simulate additional hash functions of the form gi(x) = h1(x)+ih2(x). We demonstrate
that this technique can be usefully applied to Bloom filters and related data structures. Specifically,
only two hash functions are necessary to effectively implement a Bloom filter without any loss in
the asymptotic false positive probability. This leads to less computation and potentially less need for
randomness in practice.